1 Executive Summary

Education. Work. Money: three words almost guaranteed to be at the forefront of a student’s mind as they contemplate their future upon leaving highschool. With more and more opportunities for anyone to pursue almost any degree in any field, the inevitable question of “Which major should I take?:” is becoming a harder and harder choice for students across the world.

In this research project, data from over 6.7 million college graduates in the USA has been analysed to examine key questions regarding:

  • Which college major should a student take to receive the highest income?
  • Which college major should a student pursue to receive the greatest prospects of employment?
  • How do these “rankings” align with the popularity of these courses?

Data shows that Engineering majors often see the highest levels of income, followed by other major categories such as Business and Law. In particular, Petroleum Engineering stands over the rest of the majors with a median income of USD$110,000, compared to a median of $36,000.

Further, UNEMPLOYMENT.

Additionally, POPULARITY.



2 Full Report

2.1 Initial Data Analysis (IDA)

2.1.1 Domain Knowledge

  1. In 2012, the median personal income for the US was $28,213, with unemployment at 8.1% for 2012.

  2. Recently, a research article has shown that “since the [GFC]… students have turned away from the humanities and towards job-oriented degrees” (Kopf, 2018), with the share of degrees in history dropping from 2% 2007 to 1% 2017 (Kopf, 2018). This seems to reflect “a new set of student priorities… formed even before they see the inside of a college classroom… Students [are] fleeing humanities and related fields specifically because they think they have poor job prospects.” (Schmidt, 2018).

2.1.2 Source

The data was collected from the American Community Survey 2010 - 2012 Public Use Microdata Sample Files (PUMS) at the USA Census Website. It was initially wrangled by media company FiveThirtyEight (a part of ABC News Internet Ventures), with code accessible here.

The US Bureau of the Census is a government body, and although FiveThirtyEight had commercial interests, their process of data wrangling was highly transparent and reproducible. Therefore these sources can be considered reliable.

2.1.3 Stakeholders

The Census Bureau produces the PUMS as an inexpensive and accessible datasource for students and social scientists, while FiveThirtyEight wrangled this data for commercial use in their article The Economic Guide to Picking a College Major, aimed at educating students on how to choose their college majors.

Drawing upon the domain knowledge, this data is particularly relevant to highschool leavers trying to choose a major, as well as current students contemplating their career prospects, as it may help them make a more informed economic decision.

University staff, intership firms, or other organisations may also find benefit in predicting the future direction of the workforce, allowing for better resource allocation, such as investment into engineering and STEM fields.

2.1.4 Data Dictionary

str(gradData)
## 'data.frame':    173 obs. of  21 variables:
##  $ Rank                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Major_code          : int  2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
##  $ Major               : Factor w/ 173 levels "ACCOUNTING","ACTUARIAL SCIENCE",..: 141 116 113 132 24 134 2 15 109 53 ...
##  $ Total               : int  2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
##  $ Men                 : int  2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
##  $ Women               : int  282 77 131 135 11021 373 1667 960 10907 16016 ...
##  $ Major_category      : Factor w/ 16 levels "Agriculture & Natural Resources",..: 8 8 8 8 8 8 4 14 8 8 ...
##  $ ShareWomen          : num  0.121 0.102 0.153 0.107 0.342 ...
##  $ Sample_size         : int  36 7 3 16 289 17 51 10 1029 631 ...
##  $ Employed            : int  1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
##  $ Full_time           : int  1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
##  $ Part_time           : int  270 170 133 150 5180 264 296 553 13101 12695 ...
##  $ Full_time_year_round: int  1207 388 340 692 16697 1449 2482 827 54639 41413 ...
##  $ Unemployed          : int  37 85 16 40 1672 400 308 33 4650 3895 ...
##  $ Unemployment_rate   : num  0.0184 0.1172 0.0241 0.0501 0.0611 ...
##  $ Median              : int  110000 75000 73000 70000 65000 65000 62000 62000 60000 60000 ...
##  $ P25th               : int  95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
##  $ P75th               : int  125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
##  $ College_jobs        : int  1534 350 456 529 18314 1142 1768 972 52844 45829 ...
##  $ Non_college_jobs    : int  364 257 176 102 4440 657 314 500 16384 10874 ...
##  $ Low_wage_jobs       : int  193 50 0 0 972 244 259 220 3253 3170 ...

This data consists of 20 variables (excluding “Rank” which orders the subjects by Median income), however, only xx variables are relevant for the study:

Data Classifications

Major_code

A unique code for each major, given by the source.
Type: Integer
Assessment: Although it is a number, a factor classification would be more suitable as the codes are considered nominal (no order).

Major

The major’s name.
Type: Factor
Assessment: Either a character or factor classification would be suitable.

Total, Men, Women

Amount of total people, men, and women respectively with that major in the sample for 2010-2012.
Type: Integer
Assessment: Suitable.

Major_category

General category for that major (e.g. “Engineering”).
Type: Factor
Assessment: Suitable - allows for easy classification and plotting.

ShareWomen

Women as a percentage of Total.
Type: Number
Assessment: Suitable, since it is provided as a decimal (multiply by 100 if plotting percentages).

Sample_size

Sample size for calculating income quartiles.
Type: Integer
Assessment: Suitable.

Employed, Full_time, Part_time

Number of people employed, employed 35 hours or more per week, and employed 35 hours or less respectively.
Type: Integer
Assessment: Suitable.

Full_time_year_round

Number of people employed for at least 50 weeks per year and over 35 hours hours per week.
Type: Integer
Assessment: Suitable.

Unemployed

Number of people considered unemployed by census data.
Type: Integer
Assessment: Suitable.

Unemployment_rate

The percentage of people unemployed over (unemployed + employed).
Type: Number
Assessment: Suitable.

Median, P25th, P75th

Median, 25th percentile, and 75th percentile earnings respectively for full-time, year-round workers (in USD).
Type: Integer
Assessment: Suitable - although income is continuous, it can be considered discrete without significantly impacting the data.

College_jobs, Non_college_jobs, Low_wage_jobs

Number of people with a job requiring a college degree, not requiring a college degree, and in a low-wage service job respectively.
Type: Integer
Assessment: Suitable.

2.1.5 Data Assessment

Possible Issues:

  • The data consists of pre-summarised information, where each subject is a set of medians, percentages, etc. Therefore, care needs to be made in plotting and drawing conclusions.
  • The data is categorised as an observational study, lacking in a control group and randomised allocation between majors. Therefore, any conclusions drawn must be treated as associativity rather than causality.
  • The data is 7 years old, and income has not been adjusted for inflation. However, this is not a major issue considering relative comparisons are being made.
  • The data has been pre-filtered to only include subjects below the age of 28. However, this can be positive in that it is more relevant to current university students.


Validity:
This data, taking into account the issues above and their solutions, can be considered valid. However, care must be taken to acknowledge confounders, such as personality and circumstance, rather than just major choice, in influencing the variables.

2.2 Research Question 1

Which college major should a student take to receive the highest income?

There are three variables to consider - the 25th percentile, median, and 75th percentile incomes. Additionally, it is important to consider both individual majors and major categories. Taking a summary initially shows that there is a significant range of incomes:

summary(gradData$Median)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   33000   36000   40151   45000  110000
hline <- function(y = 0) {
  list(
    type = "line", 
    x0 = 0, 
    x1 = 1, 
    xref = "paper",
    y0 = y, 
    y1 = y, 
    line = list(color = "red", width=1)
  )
}
plot_ly(gradData, y=~Median/1000, color=~Major_category, type="box") %>% 
  layout(    
    yaxis = list(title = "Median income (USD$1000)"),
    xaxis = list(showticklabels = FALSE),
    title = "Median Income per Major Category",
    shapes = list(hline(36)))

Plotting the median income against major category backs up the summary - showing a large spread, centred around the median of $36,000.

# Selects top 10
gradData.head = head(gradData, n=10)
# Creates a new data frame to be able to plot median, 25th, and 75th percentiles on the same graph
median.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category, 
                       Income = gradData.head$Median)
p25.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category, 
                       Income = gradData.head$P25th)
p75.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category, 
                       Income = gradData.head$P75th)
gradData.head.df = rbind(median.df, p25.df, p75.df)
# Selects bottom 10
gradData.tail = tail(gradData, n=10)
# Creates a new data frame to be able to plot median, 25th, and 75th percentiles on the same graph
median.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category, 
                       Income = gradData.tail$Median)
p25.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category, 
                       Income = gradData.tail$P25th)
p75.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category, 
                       Income = gradData.tail$P75th)
gradData.tail.df = rbind(median.df, p25.df, p75.df)
score = rbind(gradData.tail, gradData.tail, gradData.tail)
# Combines the two 
gradData.combined.df =rbind(gradData.head.df, gradData.tail.df)
# Orders it
score = rbind(gradData.head, gradData.head, gradData.head, score)
# Plots a boxplot
plot_ly(gradData.combined.df, y=~Income/1000, x=~reorder(Major, -score$Median), color=~Major_category, type="box") %>%
  layout(    
    yaxis = list(
      title = "Median income (USD$1000)",
      autotick = FALSE,
      ticks = "outside",
      tick0 = 0,
      dtick = 10,
      ticklen = 3,
      tickwidth = 1,
      tickwidth = 1),
    xaxis = list(
      showticklabels = TRUE, title="",
      tickangle = 270, tickfont = list(size = 10)),
    title = "Top 10 and Bottom 10 Majors by Median Income")

Looking at individual majors, there are initially too many data-points to make sense of the information. Instead, ordering the data by median income, the subjects can be limited to only the top and bottom 10 majors (note that this plotted data takes into account median, 25th percentile, and 75th percentile). While 9 of the top 10 majors belong to the Engineering category, the bottom 10 majors are considerably more varied.


combined = gradData
median.df = data.frame(Major = combined$Major, Major_category=strtrim(combined$Major_category, 45), 
                       Income = combined$Median)
p25.df = data.frame(Major = combined$Major, Major_category=strtrim(combined$Major_category, 45), 
                       Income = combined$P25th)
p75.df = data.frame(Major = combined$Major, Major_category=strtrim(combined$Major_category, 45), 
                       Income = combined$P75th)
combined.df = rbind(median.df, p25.df, p75.df)
ggplotly(ggplot(combined.df, aes(x=Income/1000, fill=Major_category)) + geom_density(alpha=0.2) + facet_wrap(~Major_category) +
           xlab("Income (USD$1000)") + ylab("Density") + labs(title="Income Density per Major Category") + theme_minimal() + theme(legend.position="none", strip.text.x = element_text(size = 7)))
combined = rbind(gradData[gradData$Major_category=="Engineering",],
                 gradData[gradData$Major_category=="Education",], gradData[gradData$Major_category=="Business",])
median.df = data.frame(Major = combined$Major, Major_category=combined$Major_category, 
                       Income = combined$Median)
p25.df = data.frame(Major = combined$Major, Major_category=combined$Major_category, 
                       Income = combined$P25th)
p75.df = data.frame(Major = combined$Major, Major_category=combined$Major_category, 
                       Income = combined$P75th)
combined.df = rbind(median.df, p25.df, p75.df)
ggplotly(ggplot(combined.df, aes(x=Income/1000, fill=Major_category)) + geom_density(alpha=0.2) +
           xlab("Income (USD$1000)") + ylab("Density") + labs(fill="Major Category", title="Income Density per Major Category (Selection)") + theme_minimal())

Examining the density estimation of a selection of major categories, again Engineering appears to have significantly higher incomes compared to other categories. However, the estimation shows that the spread is also significantly larger, with a portion of the income falling within the range of the lowest majors. This is contrasted with Education, where the range is confined to ~$25,000.

# Coefficient of Variation for the Engineering sample's incomes
paste("Coefficient of Variation for Engineering:", sd(combined.df[combined.df$Major_category=="Engineering",]$Income)/
  mean(combined.df[combined.df$Major_category=="Engineering",]$Income)) 
## [1] "Coefficient of Variation for Engineering: 0.329608913424962"
# Coefficient of Variation for the Education sample's incomes
paste("Coefficient of Variation for Education:", sd(combined.df[combined.df$Major_category=="Education",]$Income)/
  mean(combined.df[combined.df$Major_category=="Education",]$Income)) 
## [1] "Coefficient of Variation for Education: 0.20893468637973"

This is re-iterated by the coefficient of variation for Engineering being over 150% of Education’s.

Summary:
The data shows that Engineering incomes can far exceed those in other categories, with Petroleum Engineering in particular being significantly higher than the other majors. Indeed, the separation of Petroleum Engineering from the other top 10 median incomes is comparable to the separation of the top 10 from the bottom 10. However, Engineering incomes overall have a significantly larger spread than the other categories, implying a volatility either between majors or within the industries themselves. Nevertheless, students seeking high incomes may be best suited to look towards Engineering fields.

2.3 Research Question 2

Which college major should a student pursue to see the greatest prospects for employment?


Insert text and analysis.

percentCollege = gradData$College_jobs/(gradData$College_jobs+gradData$Non_college_jobs+gradData$Low_wage_jobs)
ggplot(gradData, aes(x=factor(Major_category), y=percentCollege)) + 
  geom_boxplot(aes(fill = factor(Major_category))) +
  theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank()) +
  labs(fill = "Major Category", title="Percent College Jobs per Major Category") + 
  ylab("Percentage of Jobs as College Jobs")
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).


How much of this employment is actually in their field / based on the degree (looking at college jobs vs non-college jobs)?


2.4 Research Question 3

Looking at the results from Q1 and Q2, how do these “rankings” align with the popularity of these courses?

(based on total people = employed + unemployed)

Insert text and analysis.

Summary:


3 References

Bureau of Labor Statistics. (2019). Labor Force Statistics from the Current Population Survey, 2012 (LNU04000000) [Data set]. Retrieved from http://data.bls.gov.

Casselman, B. (2014, September 12). The Economic Guide to Picking a College Major. FiveThirtyEight. Retrieved from https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/.

FiveThirtyEight. (2014). College Majors 2010-2012 (Recent Grads) [Data set]. Retrieved from Github; https://github.com/fivethirtyeight/data/tree/master/college-majors.

Kopf, D. (2018, August 29). The 2008 financial crisis completely changed what majors students choose. Quartz. Retrieved from https://qz.com/1370922/the-2008-financial-crisis-completely-changed-what-majors-students-choose/.

Schmidt, B. (2018, August 3). The Humanities Are in Crisis. The Atlantic. Retrieved from https://www.theatlantic.com/ideas/archive/2018/08/the-humanities-face-a-crisisof-confidence/567565/.

US. Bureau of the Census. (2018). Public Use Microdata Samples (PUMS) Documentation. Retrieved from https://www.census.gov/programs-surveys/acs/technical-documentation/pums.html.

U.S. Bureau of the Census. (2017). Real Median Personal Income in the United States, 2012 (MEPAINUSA672N) [Data set]. Retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/MEPAINUSA672N.

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2 plotly_4.8.0   ggplot2_3.1.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0         RColorBrewer_1.1-2 later_0.8.0       
##  [4] pillar_1.3.1       compiler_3.5.2     plyr_1.8.4        
##  [7] bindr_0.1.1        tools_3.5.2        digest_0.6.18     
## [10] viridisLite_0.3.0  jsonlite_1.6       evaluate_0.12     
## [13] tibble_2.0.1       gtable_0.2.0       pkgconfig_2.0.2   
## [16] rlang_0.3.1        shiny_1.2.0        crosstalk_1.0.0   
## [19] yaml_2.2.0         xfun_0.4           withr_2.1.2       
## [22] dplyr_0.7.8        stringr_1.4.0      httr_1.4.0        
## [25] knitr_1.21         htmlwidgets_1.3    grid_3.5.2        
## [28] tidyselect_0.2.5   glue_1.3.0         data.table_1.12.0 
## [31] R6_2.3.0           rmarkdown_1.11     tidyr_0.8.2       
## [34] purrr_0.3.0        magrittr_1.5       promises_1.0.1    
## [37] scales_1.0.0       htmltools_0.3.6    assertthat_0.2.0  
## [40] xtable_1.8-3       mime_0.6           colorspace_1.4-0  
## [43] httpuv_1.4.5.1     labeling_0.3       stringi_1.2.4     
## [46] lazyeval_0.2.1     munsell_0.5.0      crayon_1.3.4